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Abstract 

The runtime for Kernel Partial Least Squares (KPLS) to compute the fit is quadratic 
in the number of examples. However, the necessity of obtaining sensitivity measures as 
degrees of freedom for model selection or confidence intervals for more detailed analysis 
requires cubic runtime, and thus constitutes a computational bottleneck in real-world data 
analysis. We propose a novel algorithm for KPLS which not only computes (a) the fit, 
but also (b) its approximate degrees of freedom and (c) error bars in quadratic runtime. 
The algorithm exploits a close connection between Kernel PLS and the Lanczos algorithm 
for approximating the eigenvalues of symmetric matrices, and uses this approximation to 
compute the trace of powers of the kernel matrix in quadratic runtime. 

1 INTRODUCTION 

Partial Least Squares (PLS) [281 [29] is a supervised dimensionality reduction technique. Given 
n observations (x{, yj) GK^xl, it iteratively constructs an orthogonal set T = (t%, . . . , t m ) 6 
B>nxm of m latent features which have maximal covariance with the target variable y = 
(yi, . . . , y n ). For regression, these latent components are used as predictors in a least squares 
fit instead of the original data leading to fitted values 

y m = T(T T Ty 1 T T y = r T y, (1) 

where V denotes the orthogonal projection operator. PLS is the standard tool e.g. in 
chemometrics |16| . and has been successfully applied in various other scientific fields such 
as chemoinformatics, physiology or bioinformatics [25 } I23 |. fTj . In combination with the kernel 
trick [201 122| . Kernel Partial Least Squares (KPLS) performs dimensionality reduction and 
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regression in a non-linear fashion. KPLS has some appealing properties over existing kernel 
methods. Due to its iterative nature, it only relies on matrix-vector multiplications. Hence 
its runtime is quadratic in the number of training examples , as opposed to - for example - 
Kernel Ridge Regression, which requires the inversion of a large symmetric matrix, having 
time complexity C(n 3 ). Furthermore, it is possible to compute the derivative of the KPLS 
solution with respect to y by differentiating the iterative formulation itself. Taking the trace 
of the derivative of the fitted values, one obtains an estimate of the degrees of freedom for 
KPLS, which can be used, for example, for effective model selection based on information 
criteria like AIC, BIC, or gMDL [13]. The first order Taylor approximation can also be used 
to construct confidence intervals for PLS [51 [19]. However, since we take the derivative of 
a vector ([T]), the derivative is a matrix, and the computation of the derivative involves a 
number of matrix-matrix multiplications which have time complexity 0(n 3 ) for all practical 
considerations. 

In this work, we propose an algorithm which computes (a) the fit of KPLS as well as 
(b) its approximate degrees of freedom and (c) confidence intervals for the KPLS solutions, 
all in quadratic runtime. These results are based on the fact that PLS is equivalent to the 
Lanczos method for approximating the eigenvalues of the kernel matrix K by the eigenvalues 
of a tridiagonal m X m matrix D. The main contribution is to compute these approximate 
eigenvalues using KPLS itself. Then, using a different formulation of the derivative of the 
fit in KPLS, one can approximate the trace of powers K 3 of the kernel matrix using the 
matrix D. Since D is typically small (as it scales with the number of components), the 
runtime for computing the eigenvalues is cubic in m, and therefore, unproblematic. Since the 
powers of the Kernel matrices K J are the only matrix-matrix multiplications of order n in 
the formula for the degrees of freedom, the approximation leads to quadratic runtime. Hence, 
we use the KPLS fit to approximate its degrees of freedom. In addition, using the alternative 
formulation of the derivative, one can perform a sensitivity analysis of KPLS resulting in 
confidence intervals on the estimates, also in quadratic runtime. 

This paper is structured as follows. In Section [21 we review the connection between KPLS 
and Lanczos approximations, and summarize the state-of-the-art for computing the derivative 
of Kernel PLS. In Sections [3] and [U we propose our novel formulation of the derivative together 
with the quadratic runtime algorithms for the degrees of freedom and the confidence intervals. 
We conclude with some practical examples. 

PLS is closely related to Krylov methods. Therefore, we briefly recall the definition 
of Krylov subspaces. For a matrix C £ R cxc and c £ M c , we call the set of vectors 
c, Cc, . . . ,C m_1 c the Krylov sequence of length m. The space spanned by these vectors 
is called a Krylov space and is denoted by fC m (C, c). 



2 BACKGROUND: PLS, LANCZOS METHODS, AND SEN- 
SITIVITY ANALYSIS 



In this paper, we focus on the NIPALS algorithm [28] for PLS. For different forms of PLS, 

see [21]. The n centered observations (xi,yi) are pooled into a n x d data matrix X and a 
vector y 6 W l . PLS constructs m orthogonal latent components T = (t\, . . . ,t m ) € M. nxm in 
a greedy fashion. The first component t\ = Xw\ fulfills 




w\ = arg max cov(Xw,y) 2 



1 




(2) 
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Subsequent components £2, £3, • • • are chosen such that they maximize the squared covariance 
to y and that all components are mutually orthogonal. Orthogonality can be ensured by 
deflating the original variables X 

Xi = X — T , t 1 ,...,u_ 1 X , 

and then replacing X by Xi in ([2]). The matrix W = (w\, . . . ,w m ) £ R ixm can be shown 
to be orthogonal as well (e.g. [9]). Note furthermore that the latent components are usually 
scaled to unit norm. Kernel PLS \20\ [22] can be derived by noting that Wi = X T r{ with 

n = (y-y^/WK^iy-y^W (3) 

denoting the normalized residuals, and by deflating the kernel matrix K instead of X, 

= (4-^)^(4-^) . 

In contrast to e.g. Principal Component Analysis, the latent components T depend on the 
response, and hence the fitted values ([I]) are a nonlinear function of y. 

Recall that in the nonlinear case, KPLS depends on the kernel parameters (e.g. the 
width of an rbf-kernel) and the optimal number m of latent components. Thus, for model 
selection, one has to select the optimal combination on a grid of possible kernel parameters 
and components from 1 to a maximal amount m* of components. 

2.1 KRYLOV METHODS AND LANCZOS APPROXIMATION 

To predict the output for a new observation, we have to derive the regression coefficients /3 m 
(in the linear case) and kernel coefficients a m (in the nonlinear case), which are defined via 
y m = X(3 m = Ka m . This can be done by using the fact that PLS is equivalent to the Lanczos 
bidiagonalization of X [14] : The orthogonal matrices T and W represent a decomposition of 
X into a bidiagonal matrix L £ jj mxm v ia 

XW = TL (4) 

with lij = for i > j or i < j — 1. This matrix is defined as L = T T XXV. This implies 
[El El [22] 

3 m = WL^y and a m = RL^T T y 

with R = (n, . . . ,r m ) G M. nxm . Furthermore, it can be shown [18] that PLS is equivalent to 
the conjugate gradient (CG) algorithm [8]. The latter is a procedure that iteratively computes 
approximate solutions of the normal equation A/3 = b (with A = X T X and b = X T y) by 
minimizing the quadratic function 1/2/3 T A/3 — (3 b along directions that are A-orthogonal. 
These search directions span the Krylov space defined by A and b. The approximate solution 
of CG obtained after m steps is equal to the PLS estimate /3 m with m components. Moreover, 
the weight vectors W are an orthogonal basis of JC m (A, b). 

Krylov methods are also used to approximate eigenvalues of A by "restricting" A onto 
Krylov subspaces: In terms of the orthogonal basis W of /C m (A, b), the map 

D = ^/C m (A,b)^/C m (Ab) 
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is represented by 



D = WAW. 



(5) 



D is shown to be tridiagonal, and the m distinct eigenvalues fj,\ > > • • • > A*m oi D - 
called Ritz values - are good approximations of the eigenvalues of A [21]. One immediate 
consequence of the connection between PLS and Krylov spaces is the fact that the latent 
components span the Krylov space defined by K and Ky. This implies that 

Vm = V Km{K ^ Ky) y. (6) 
2.2 SENSITIVITY ANALYSIS FOR KPLS 

Sensitivity measures are crucial in at least two important scenarios. On the one hand, they 
are needed to select the correct model (in terms of a suitable kernel and the number of 
components) when using information criteria. On the other hand, to assess the stability of 
the solution, one needs to measure the influence of small noise in the training points on the 
learned function. For example, areas with a high sensitivity require further data points to 
stabilize the solution in an ambiguous area. Furthermore, if for some regions, the prediction 
does not depend on the training points at all, this indicates that further data points are 
necessary. 

Both of these questions - model selection and stability analysis - can be addressed by 
computing the derivatives of the KPLS solution with respect to y, either of the fitted labels 
y m , or of the learned kernel coefficients a m . Let us consider the regression model 

Y t = f(x i )+e i ,e l r,M(0,a 2 ). (7) 

For a general regression method with fitted values y, the degrees of freedom are defined as 

pniE] 

DoF = E [trace (dy/dy)} 

with the expectation E taken with respect to Y\, . . . , Y n . An unbiased plug-in estimate of the 
degrees of freedom is therefore given by 

DoF = trace (dy/dy) . (8) 

Degrees of freedom in combination with information criteria can be used for model selection. 
As the KPLS solution depends nonlinearly on y, the computation of the derivative is necessary. 
Kramer & Braun [13] derive an algorithm for the derivative of y m by transforming the Lanczos 
decomposition ([!]) into a Kernel representation and by exploiting its sparsity. The resulting 
iterative algorithm for ([8|) is then used successfully for model selection. This method scales 
cubically in the number of examples. 

For the construction of confidence intervals for a fitted kernel function 

n 

f{x) = ^2aik(x,Xi) . 
i=l 

one needs to study the influence of an infinitesimal perturbation in the values of y. If the 
kernel coefficients depended linearly on y via a = Hy, the distribution of the prediction f(x) 
at any point x would be given by 

f(x) - Af(k{x) T E{a] ,a 2 k{x) T HH T k(x)^j (9) 
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with k(x) = (k(x, x\), . . . , k(x, x n )) £ R n . However, as KPLS depends nonlinearly on y, 
the distribution of S. m can only be determined approximately by using a first order Taylor 
expansion, i.e. one uses 

H m « (da m /dy) . (10) 

To the best of our knowledge, confidence intervals for PLS have only been constructed in the 
linear setting, but the results can easily been extended to the Kernel case. Phatak et. al. 
[T9] use ([6]) to explicitly calculate the derivative of the PLS coefficients (3 m , and obtain 
an approximate distribution of (5 m . As the formula depends on matrix multiplications of 
order (nm) x (nm), this approach is computationally expensive. Furthermore, as the Krylov 
sequence Ky, . . . ,K m y is nearly collinear, the formula is numerically unstable. In [SJ 126] . 
an iterative formulation of PLS is used to construct the derivative of j3 m . Finally, we remark 
that the approach by [13] using the Lanczos decomposition can be extended to the derivative 
of the kernel coefficients. 

The drawback of all of these approaches is their poor scalability. All algorithms are cubic 
in the number of observations. In the following two sections we exploit that we do not need 
the derivative itself, but only the trace of the derivative for the degrees of freedom, and ([9]) 
for the construction of confidence intervals. The key advantage is that we can compute these 
approximation schemes in quadratic runtime. 



3 APPROXIMATE DEGREES OF FREEDOM IN QUADRATIC 
RUNTIME 

The key ingredients for the derivation of approximate degrees of freedom are (1) the identi- 
fication of those terms that are cubic in n, and (2) the approximation of those terms using 
Lanczos approximations. 

First, we extend the results of [19] to the computation of the derivative of y m . We define 
the m x m matrix B via by = (ti,K^y) . The matrix is regular and upper triangular, as the 
latent components T are an orthogonal basis of the Krylov subspace IC m (K, Ky). 

Proposition 1. Let c = B~ 1 T T y and V = {v\, . . . ,v m ) = TB T . We have 
dy m 



dy 



\c T ® (J n - Vt)] Q t + [v <g> (y - y n 



c 3 [In ~ TT T j Ki + Vj (V ~ Vm) T JO + TT T . 

3=1 3=1 



Here, <g> is the Kronecker product and Q = (K, K 2 , . . . , K m ) £ ]^nxnm^ 

Proof. The first line follows by computing the derivative of the projection operator ([6]) and by 
applying a basis transformation from the Krylov sequence Ky, . . . , K m y to the orthogonal 
basis ti, . . . ,t m . The latter ensures that the formula is numerically more stable. For the 
second line, we represent the Kronecker product as a sum. □ 

As a consequence, we yield a formula for the degrees of freedom of KPLS. 
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Corollary 2. An unbiased estimated of the degrees of freedom of KPLS with m components 
is given by 



m m 

DoF(m) = E c i trace [( J ™ " TTT ) Rj ] + E & " ^ m ) T + m 



m 

= Y^ c j trace + m 

m / m \ m. 

-EE *' Tkj *' + - ^™) T E 
j=i v«=i / j=i 

This representation of the DoF of KPLS reveals an interesting feature. The computation 
of last line is quadratic in n, as it only involves matrix-vector multiplications. The first line 
however is cubic in n, as we need to compute the trace of powers of the kernel matrix K 3 for 
j = 1, . . . ,m. 

3.1 APPROXIMATE DEGREES OF FREEDOM VIA RITZ VALUES 

As explained above, PLS is equivalent to Lanczos approximations, and can be used to ap- 
proximate the eigenvalues of X T X via the tridiagonal matrix D defined in ([5]). Note that D 
has a kernel representation 

D = R T K 2 R = L T L. (11) 

with R the matrix of normalized residuals defined in (|3|). The eigenvalues of D are called 
Ritz values and constitute approximations of the eigenvalues of X T X [23]. The quality of the 
approximation increases with the number m of latent components, and as the computation of 
the Ritz values scales cubically only in m, an efficient strategy is to allow a generous amount 
of components for the computation of D. 

As the eigenvalues of X T X correspond to the eigenvalues of K in the kernel setting, we 
can use Ritz values to derive an approximation of the trace of K J . 

Definition 3 (Approximate Degrees of Freedom). We define the approximate degrees of 
freedom of KPLS with m components as 

m 

DoF appr (m) = E c i trace { D L W ) + m 

3=1 

m / m \ m 

-EE tiTRHi ) + (v- ym) T E Kiv i > 

j=l \l=l J j=l 

where D mmax ^ s the tridiagonal matrix defined in Ul\) computed with m max > m latent com- 
ponents. 

The computation of D only requires one additional m mSuX x m max matrix multiplication 
L L. As the matrix D is of size m maiX x m max , the runtime for the computation is cubic 
in the number of maximal components J^ max (which is typically small), and quadratic in the 
number n of examples. 
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3.2 QUALITY OF THE APPROXIMATION 

Theoretically, the validity of this approximation can be justified in terms of a deviation bound. 

Proposition 4 (Saad |24|). Denote by fi±, . . . , fi m the eigenvalues of D and by X\ > . . . > A n 

the eigenvalues of the Kernel matrix K . We have 



< Xi - m < (Ai - A n ) 
with Ui the ith eigenvector of K, 



Kj tan 9i 



acos 



C m _ 4 (l + 2 7i ) 



\y\\K 



and 



i-l 

n 



— X n Xi — Xi-i 



7, 



i ~ ^ Aj+i — A n 

Here, C\ denotes the Chebychev polynomial of order I. 

Note that Oi is the angle between b and the ith eigenvector of X T X - computed in feature 
space. This inequality implies that the approximation for the ith eigenvalue is good under 
two different scenarios. Either Xi is already close to zero, so \i% < Xi is close to zero as well. 
For large eigenvalues Aj, the approximation is good if (a) the eigenvalues of K decay fast, (b) 
the angle Oi corresponding to the ith eigenvector is small, and (c) the index i is not too large 
compared to m. Property (a) is a feature of rbf- kernels, which we use throughout the rest 
of the paper. Condition (b) is typically fulfilled for the leading eigenvectors of K [21 [3], and 
condition (c) can be fulfilled by using a sufficient large amount m max of components. 

In practical applications, two important issues are the quality of the approximate degrees 
of freedom, and the quality of the model selection criteria based on these approximate degrees 
of freedom. In accordance with [13] . we choose generalized minimum description length 
(gMDL) [7] as model selection criterion. 

The simulation setting follows the regression model ([7]) with f(x) = sinc(x). We draw 
n = 100 inputs Xi uniformly from [— it, tt] and set the standard deviation to a = 0.1. We fit 
KPLS with three different rbf-kernels of width 0.01,0.1, 1 and use different numbers m max of 
maximal components. In addition, we compute the DoF, the approximate DoF, the gMDL 
criterion, and gMDL based on the approximate DoF. 

Figure []] displays the results for the different kernel widths 0.01 (left), 0.1 (center) and 
1 (right). The first row shows the degrees of freedom of KPLS (blue line) and approximate 
degrees of freedom of KPLS depending on the number of maximal components (red dashed 
line). As indicated by Proposition [U the approximation becomes more accurate if m mSuX is 
large. Furthermore, the approximation depends on the width of the rbf-kernel. For very 
small kernel widths (left), the eigenvalues of the kernel matrix decay very slowly, and more 
components are needed to compensate. In the second line of Figure [H we display gMDL 
(blue line) and the approximate gMDL depending on the number of maximal components 
(red dashed line). The behavior of the approximation is qualitatively the same: It depends 
on the size of the kernel widths, and in general, it becomes more accurate if more components 
are used to compute D. 
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number of components number of components number of components 




number o( components number of components number of components 



Figure 1: Quality of the approximate degrees of freedom. Results for kernel widths 0.01 
(left), 0.1 (center), and 1 (right). Top row: DoF (blue line) and approximate DoF (red 
dashed line) for different numbers of maximal components. Bottom row: gMDL (blue line) 
and approximate gMDL (red dashed line) for different numbers of maximal components. 



3.3 RUNTIME COMPARISON 

As shown above, the approximation of the Degrees of Freedom of KPLS leads to reduction in 
runtime from cubic to quadratic. We now illustrate that this leads to a considerable speed up 
even for medium sized data. We used the kin regression data set from the delve repository. 
This eight-dimensional synthetic data set is based on a model of a robotic arm, and the task 
consists in predicting the position of the arm based on the angles of its joints. It consists of 
8192 data points. For sub-samples of size 100, 200, . . . , 1000, we compute (a) KPLS and its 
Degrees of Freedom for up to m = 10 components and (b) KPLS and its approximate Degrees 
of Freedom for up to m = 10 components. In both cases, we use a Gaussian Kernel. Note 
that for (b), we compute m max = 30 components in order to obtain a close approximation, 
hence the number of KPLS iterations is three times higher for alternative (b). The runtime 
of both variants are displayed in Fig. [2j The gap between the two methods is clearly visible 
already for small sample sizes, and the two graphs show the expected quadratic versus cubic 
form. While the latter is an empirical illustration of the theoretical runtime analysis that we 
present above, it is important to stress that the improvement from O (n 3 ) to O (ra 2 ) is not an 
asymptotic result but also leads to a significant improvement in runtime already for medium 
sized data. 

1 http : //www. cs . toronto . edu/ "delve/ 



8 




Figure 2: Comparison of runtime on the "kin" data set. Jagged line: KPLS with exact 
Degrees of Freedom for m = 10 components. Solid line: KPLS with approximate Degrees 
of Freedom for m = 10 components and m max = 30 components for the approximation of 
the eigenvalues of the kernel matrix. Hence, for the approximation, the effective number of 
components is three times higher. 

4 CONFIDENCE INTERVALS IN QUADRATIC RUNTIME 

For the derivation of (approximate) confidence intervals ([9]), we need to compute the quantity 
H m k(x), where H m y is the first order Taylor approximation of the kernel coefficients a m . 
Using the representation from proposition [H we can directly compute this matrix-vector 
product, even without approximating the eigenvalues and thus compute the exact expression 
in quadratic runtime. 

Note that Taylor expansions occur in both types of approximations, for the Degrees of 
Freedom as well as for the confidence intervals. However, there are essential conceptual 
differences. For the Degrees of Freedom, the representation in terms of derivatives ([8]) is in 
fact no approximation but due to the assumption of normally distributed errors in ([7J) which 
leads to Stein's Lemma [27j. In this case, the Degrees of Freedom are approximated using 
Lanczos methods and Ritz values. In contrast, for the confidence intervals, we have to use the 
Taylor expansion (|10p to obtain an approximate distribution ([9]) for the KPLS parameters. 
The computation of the Taylor expansion H m defined in (|10p is cubic in n as it involves 
multiplications of matrices of size nxn. Here, we reduce the computational cost to O (n 2 ) by 
cleverly exploiting the fact that the matrix-vector product H m k(x) is a sufficient statistic. 

Proposition 5. We have 

m 

H^fe(aj) = { C 'J { In ~ KTN R T ) 

+ K{y- y m ) ttj} k(x) + TNR T k(x) . 

with R denoting the matrix of normalized residuals, N denoting the m x m diagonal matrix 
consisting of elements na = 1/||-K>;|| and 

U = u m ) = RNB T . 
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KPLS with kernel width 0.1 and 1 5 components KPLS with kernel width 1 and 9 components 




Figure 3: Confidence Intervals for KPLS. Left: KPLS with 15 components and an rbf-kernel 
of width 0.1 Right: KPLS with 9 components and an rbf kernel of width 1 

Proof. As 

dy m /dy = K (da m /dy) , 

the formula can be shown by "canceling out" K in the formula of the derivative of y m , and 
then multiplying the formula with k(x). □ 

Illustration Again, we use the regression model ([7]), with f(x) = (x — l)(x + 2)(x — 
1.5) exp(— x 2 /10) and a = 1. We draw n = 40 points X{ from a mixture of two normal 
distribution with mean —2 and 3 a variance of 1 in both cases. We fit KPLS with for two 
different models, (1) KPLS with 15 components and an rbf-kernel of width 0.1 and (2) KPLS 
with 9 components and an rbf kernel of width 1. Figure [3] shows the KPLS fit and its 
confidence intervals (based on a level of 98%) for the two models. 

In areas with high data density, the prediction is quite stable with small confidence inter- 
vals. Next to such high density areas, the predictions becomes unstable, as they can depend 
quite sensitively on the neighboring data. Finally, when one moves far away from the data 
points, their influence decreases to zero. This is much more apparent in the left plot with the 
small kernel widths. 

5 CONCLUSION 

We proposed an implementation of the Kernel PLS method which not only computes the 
fit in quadratic time, but a degree-of-freedom estimate and confidence intervals based on a 
sensitivity analysis, which formerly required cubic runtime. The latter estimates can be used, 
for example, for model selection, or to measure the local stability of the learned function. The 
approximation schemes exploit the fact that Kernel PLS can be extended to compute Lanczos 
type approximations of the eigenvalues as well. Together with a novel formula for computing 
the derivatives of the kernel parameters ct, these approximations allow us to replace costly 
computation of powers of the kernel matrix. In summary, one obtains a Kernel PLS algorithm 
which also provides relevant additional information for model selection and provide further 
insight into the complexity and stability of the learned function. 



10 



Our results capitalize on the close connection between the dimensionality reduction tech- 
nique PLS on the one hand and Krylov methods and Lanczos approximations on the other 
hand. While the latter two methods are commonly used in numerical linear algebra, their ben- 
efits for data analysis have not yet been exploited sufficiently. Only recently (e.g. [T?] HI [TU]) 
they are utilized explicitly in a machine learning framework. Recent research results on the 
correspondence of penalization techniques and preconditioning of linear systems [T2] fur- 
ther underpin the strong potential of these methods. We strongly believe that the interplay 
between numerical linear algebra and machine learning will further stimulate the field of data 
analysis. 
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